loads the main libraries (OpenAI SDK, Altair, Plotly, …)
looks for your OpenAI key in key/openai_key.txt
initialises a client (o3 by default – flip to o3-pro for >128 k-token papers)
defines pdf_to_string()
– drops the reference section and truncates at 180 k tokens so we stay in-context.
Show code
import os, pathlib, json, textwrap, pdfplumber, re, pandas as pd, numpy as npfrom typing import Dict, Anyfrom openai import OpenAIimport altair as altalt.renderers.enable("html") import plotly.io as pioimport plotly.graph_objects as go# # Install if it isn’t present # %py -3.13 -m pip install --upgrade openai# # -------------------------------------------------------------------# Locate API key: env var ➜ key/openai_key.txt# -------------------------------------------------------------------key_path = pathlib.Path("key/openai_key.txt")if os.getenv("OPENAI_API_KEY") isNoneand key_path.exists(): os.environ["OPENAI_API_KEY"] = key_path.read_text().strip()ifnot os.getenv("OPENAI_API_KEY"):raiseValueError("No API key found.\n""Create key/openai_key.txt (single line) or export OPENAI_API_KEY in your shell." )client = OpenAI() # SDK reads the key from the env varmodel ="o3"# model = "o3-pro"# ------------------------------# PDF → plain‑text utility# ------------------------------import re, pdfplumberdef pdf_to_string(path, max_tokens=180_000, model="o3-pro"):"""Extract text, drop refs, hard-cap by tokens."""import tiktoken enc = tiktoken.encoding_for_model(model)with pdfplumber.open(path) as pdf: text =" ".join(p.extract_text() or""for p in pdf.pages)# kill excessive whitespace text = re.sub(r"\s+", " ", text)# drop everything after References / Bibliography m = re.search(r"\b(References|Bibliography)\b", text, flags=re.I)if m: text = text[: m.start()]# token-trim tokens = enc.encode(text)iflen(tokens) > max_tokens: text = enc.decode(tokens[:max_tokens])return text
Response schema & system prompt
METRICS lists the seven Unjournal criteria.
response_format is a JSON-Schema guard so the model can’t hallucinate keys.
SYSTEM_PROMPT is the rubric we give the LLM.
Notes:
• Scores are percentiles vs serious work in the last 3 years.
• overall defaults to the mean of the other six but the model may override.
evaluate_paper() wraps everything: PDF → text → chat completion → dict.
Show code
# -----------------------------# 1. Metric list# -----------------------------METRICS = ["overall","claims_evidence","methods","advancing_knowledge","logic_communication","open_science","global_relevance"]# -----------------------------# 2. JSON schema# -----------------------------metric_schema = {"type": "object","properties": {"midpoint": {"type": "integer", "minimum": 0, "maximum": 100},"lower_bound": {"type": "integer", "minimum": 0, "maximum": 100},"upper_bound": {"type": "integer", "minimum": 0, "maximum": 100},"rationale": {"type": "string"} },"required": ["midpoint", "lower_bound", "upper_bound", "rationale"],"additionalProperties": False}response_format = {"type": "json_schema","json_schema": {"name": "paper_assessment_v1","strict": True,"schema": {"type": "object","properties": {"metrics": {"type": "object","properties": {m: metric_schema for m in METRICS},"required": METRICS,"additionalProperties": False } },"required": ["metrics"],"additionalProperties": False } }}# -----------------------------# 3. System prompt# -----------------------------SYSTEM_PROMPT = textwrap.dedent(f"""You are an expert evaluator.We ask for a set of nine quantitative metrics. For each metric, we ask for a score and a 90% credible interval.Percentile rankingsWe ask for a percentile ranking from 0-100%. This represents "what proportion of papers in the reference group are worse than this paper, by this criterion". A score of 100% means this is essentially the best paper in the reference group. 0% is the worst paper. A score of 50% means this is the median paper; i.e., half of all papers in the reference group do this better, and half do this worse, and so on.The population of papers should be all serious research in the same area that you have encountered in the last three years.Midpoint rating and credible intervals For each metric, we ask you to provide a 'midpoint rating' and a 90% credible interval as a measure of your uncertainty. We want policymakers, researchers, funders, and managers to be able to use evaluations to update their beliefs and make better decisions. Evaluators may feel confident about their rating for one category, but less confident in another area. How much weight should readers give to each? In this context, it is useful to quantify the uncertainty. You are asked to give a 'midpoint' and a 90% credible interval. Consider this as the smallest interval that you believe is 90% likely to contain the true value.Overall assessment- Judge the quality of the research heuristically. Consider all aspects of quality, credibility, importance to future impactful applied research, and practical relevance and usefulness.importance to knowledge production, and importance to practice. Claims, strength and characterization of evidence- Do the authors do a good job of (i) stating their main questions and claims, (ii) providing strong evidence and powerful approaches to inform these, and (iii) correctly characterizing the nature of their evidence?Methods: Justification, reasonableness, validity, robustness- Are the methods used well-justified and explained; are they a reasonable approach to answering the question(s) in this context? Are the underlying assumptions reasonable? - Are the results and methods likely to be robust to reasonable changes in the underlying assumptions? Does the author demonstrate this?- Avoiding bias and questionable research practices (QRP): Did the authors take steps to reduce bias from opportunistic reporting and QRP? For example, did they do a strong pre-registration and pre-analysis plan, incorporate multiple hypothesis testing corrections, and report flexible specifications? Advancing our knowledge and practice- To what extent does the project contribute to the field or to practice, particularly in ways that are relevant to global priorities and impactful interventions?- Do the paper's insights inform our beliefs about important parameters and about the effectiveness of interventions? - Does the project add useful value to other impactful research? We don't require surprising results; sound and well-presented null results can also be valuable.Logic and communication- Are the goals and questions of the paper clearly expressed? Are concepts clearly defined and referenced?- Is the reasoning "transparent"? Are assumptions made explicit? Are all logical steps clear and correct? Does the writing make the argument easy to follow?- Are the conclusions consistent with the evidence (or formal proofs) presented? Do the authors accurately state the nature of their evidence, and the extent it supports their main claims? - Are the data and/or analysis presented relevant to the arguments made? Are the tables, graphs, and diagrams easy to understand in the context of the narrative (e.g., no major errors in labeling)?Open, collaborative, replicable research- Replicability, reproducibility, data integrity: Would another researcher be able to perform the same analysis and get the same results? Are the methods explained clearly and in enough detail to enable easy and credible replication? For example, are all analyses and statistical tests explained, and is code provided?- Is the source of the data clear? Is the data made as available as is reasonably possible? If so, is it clearly labeled and explained?? - Consistency: Do the numbers in the paper and/or code output make sense? Are they internally consistent throughout the paper?- Useful building blocks: Do the authors provide tools, resources, data, and outputs that might enable or enhance future work and meta-analysis?Relevance to global priorities, usefulness for practitioners- Are the paper’s chosen topic and approach likely to be useful to global priorities, cause prioritization, and high-impact interventions? - Does the paper consider real-world relevance and deal with policy and implementation questions? Are the setup, assumptions, and focus realistic? - Do the authors report results that are relevant to practitioners? Do they provide useful quantified estimates (costs, benefits, etc.) enabling practical impact quantification and prioritization? - Do they communicate (at least in the abstract or introduction) in ways policymakers and decision-makers can understand, without misleading or oversimplifying?Return STRICT JSON matching the supplied schema.Fill every key in the object `metrics`:{', '.join(METRICS)}Definitions are percentile scores (0 – 100) versus serious work in the field from the last 3 years.For `overall`: • Default = arithmetic mean of the other six midpoints (rounded). • If, in your judgment, another value is better (e.g. one metric is far more decision-relevant), choose it **and explain why** in `overall.rationale`.Field meanings midpoint → best-guess percentile lower_bound → 5th-percentile plausible value upper_bound → 95th-percentile plausible value rationale → ≤100 words; terse but informative.Do **not** wrap the JSON in markdown fences or add extra text.""").strip()def evaluate_paper(pdf_path: str| pathlib.Path, model: str= model) ->dict: paper_text = pdf_to_string(pdf_path) chat = client.chat.completions.create( model=model,# temperature=temperature, response_format=response_format, messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": paper_text} ] ) raw_json = chat.choices[0].message.contentreturn json.loads(raw_json)
Batch-evaluate all PDFs
Below we iterate over every file in papers/, call evaluate_paper(), sleep 1.1 s (rate-limit padding), and store the full JSON per paper in results/.
Results
Flattens the nested JSON into long format:
one row = (paper, metric, midpoint, lower, upper, rationale).
Saves results/metrics_long.csv for quick downstream use.
Registers an Altair + Plotly “Unjournal” theme so all charts use
– brand green #99bb66 and accent orange #f19e4b
Ridgeline density plot: Kernel-density of mid-points for each metric. Height is normalised, x-axis is 0-100 %. Green fill matches the site palette.
# Quantitative metrics### SetupThe block below* loads the main libraries (OpenAI SDK, Altair, Plotly, …)* looks for your OpenAI key in **key/openai_key.txt** * initialises a client (`o3` by default – flip to `o3-pro` for >128 k-token papers)* defines `pdf_to_string()` – drops the reference section and truncates at 180 k tokens so we stay in-context.```{python}#| label: "API setup"import os, pathlib, json, textwrap, pdfplumber, re, pandas as pd, numpy as npfrom typing import Dict, Anyfrom openai import OpenAIimport altair as altalt.renderers.enable("html") import plotly.io as pioimport plotly.graph_objects as go# # Install if it isn’t present # %py -3.13 -m pip install --upgrade openai# # -------------------------------------------------------------------# Locate API key: env var ➜ key/openai_key.txt# -------------------------------------------------------------------key_path = pathlib.Path("key/openai_key.txt")if os.getenv("OPENAI_API_KEY") isNoneand key_path.exists(): os.environ["OPENAI_API_KEY"] = key_path.read_text().strip()ifnot os.getenv("OPENAI_API_KEY"):raiseValueError("No API key found.\n""Create key/openai_key.txt (single line) or export OPENAI_API_KEY in your shell." )client = OpenAI() # SDK reads the key from the env varmodel ="o3"# model = "o3-pro"# ------------------------------# PDF → plain‑text utility# ------------------------------import re, pdfplumberdef pdf_to_string(path, max_tokens=180_000, model="o3-pro"):"""Extract text, drop refs, hard-cap by tokens."""import tiktoken enc = tiktoken.encoding_for_model(model)with pdfplumber.open(path) as pdf: text =" ".join(p.extract_text() or""for p in pdf.pages)# kill excessive whitespace text = re.sub(r"\s+", " ", text)# drop everything after References / Bibliography m = re.search(r"\b(References|Bibliography)\b", text, flags=re.I)if m: text = text[: m.start()]# token-trim tokens = enc.encode(text)iflen(tokens) > max_tokens: text = enc.decode(tokens[:max_tokens])return text```### Response schema & system prompt* `METRICS` lists the seven Unjournal criteria. * `response_format` is a JSON-Schema guard so the model can’t hallucinate keys. * `SYSTEM_PROMPT` is the rubric we give the LLM. Notes: • Scores are percentiles vs serious work in the last 3 years. • `overall` defaults to the mean of the other six but the model may override.* `evaluate_paper()` wraps everything: PDF → text → chat completion → `dict`.```{python}#| label: "Response schema and system prompt"# -----------------------------# 1. Metric list# -----------------------------METRICS = ["overall","claims_evidence","methods","advancing_knowledge","logic_communication","open_science","global_relevance"]# -----------------------------# 2. JSON schema# -----------------------------metric_schema = {"type": "object","properties": {"midpoint": {"type": "integer", "minimum": 0, "maximum": 100},"lower_bound": {"type": "integer", "minimum": 0, "maximum": 100},"upper_bound": {"type": "integer", "minimum": 0, "maximum": 100},"rationale": {"type": "string"} },"required": ["midpoint", "lower_bound", "upper_bound", "rationale"],"additionalProperties": False}response_format = {"type": "json_schema","json_schema": {"name": "paper_assessment_v1","strict": True,"schema": {"type": "object","properties": {"metrics": {"type": "object","properties": {m: metric_schema for m in METRICS},"required": METRICS,"additionalProperties": False } },"required": ["metrics"],"additionalProperties": False } }}# -----------------------------# 3. System prompt# -----------------------------SYSTEM_PROMPT = textwrap.dedent(f"""You are an expert evaluator.We ask for a set of nine quantitative metrics. For each metric, we ask for a score and a 90% credible interval.Percentile rankingsWe ask for a percentile ranking from 0-100%. This represents "what proportion of papers in the reference group are worse than this paper, by this criterion". A score of 100% means this is essentially the best paper in the reference group. 0% is the worst paper. A score of 50% means this is the median paper; i.e., half of all papers in the reference group do this better, and half do this worse, and so on.The population of papers should be all serious research in the same area that you have encountered in the last three years.Midpoint rating and credible intervals For each metric, we ask you to provide a 'midpoint rating' and a 90% credible interval as a measure of your uncertainty. We want policymakers, researchers, funders, and managers to be able to use evaluations to update their beliefs and make better decisions. Evaluators may feel confident about their rating for one category, but less confident in another area. How much weight should readers give to each? In this context, it is useful to quantify the uncertainty. You are asked to give a 'midpoint' and a 90% credible interval. Consider this as the smallest interval that you believe is 90% likely to contain the true value.Overall assessment- Judge the quality of the research heuristically. Consider all aspects of quality, credibility, importance to future impactful applied research, and practical relevance and usefulness.importance to knowledge production, and importance to practice. Claims, strength and characterization of evidence- Do the authors do a good job of (i) stating their main questions and claims, (ii) providing strong evidence and powerful approaches to inform these, and (iii) correctly characterizing the nature of their evidence?Methods: Justification, reasonableness, validity, robustness- Are the methods used well-justified and explained; are they a reasonable approach to answering the question(s) in this context? Are the underlying assumptions reasonable? - Are the results and methods likely to be robust to reasonable changes in the underlying assumptions? Does the author demonstrate this?- Avoiding bias and questionable research practices (QRP): Did the authors take steps to reduce bias from opportunistic reporting and QRP? For example, did they do a strong pre-registration and pre-analysis plan, incorporate multiple hypothesis testing corrections, and report flexible specifications? Advancing our knowledge and practice- To what extent does the project contribute to the field or to practice, particularly in ways that are relevant to global priorities and impactful interventions?- Do the paper's insights inform our beliefs about important parameters and about the effectiveness of interventions? - Does the project add useful value to other impactful research? We don't require surprising results; sound and well-presented null results can also be valuable.Logic and communication- Are the goals and questions of the paper clearly expressed? Are concepts clearly defined and referenced?- Is the reasoning "transparent"? Are assumptions made explicit? Are all logical steps clear and correct? Does the writing make the argument easy to follow?- Are the conclusions consistent with the evidence (or formal proofs) presented? Do the authors accurately state the nature of their evidence, and the extent it supports their main claims? - Are the data and/or analysis presented relevant to the arguments made? Are the tables, graphs, and diagrams easy to understand in the context of the narrative (e.g., no major errors in labeling)?Open, collaborative, replicable research- Replicability, reproducibility, data integrity: Would another researcher be able to perform the same analysis and get the same results? Are the methods explained clearly and in enough detail to enable easy and credible replication? For example, are all analyses and statistical tests explained, and is code provided?- Is the source of the data clear? Is the data made as available as is reasonably possible? If so, is it clearly labeled and explained?? - Consistency: Do the numbers in the paper and/or code output make sense? Are they internally consistent throughout the paper?- Useful building blocks: Do the authors provide tools, resources, data, and outputs that might enable or enhance future work and meta-analysis?Relevance to global priorities, usefulness for practitioners- Are the paper’s chosen topic and approach likely to be useful to global priorities, cause prioritization, and high-impact interventions? - Does the paper consider real-world relevance and deal with policy and implementation questions? Are the setup, assumptions, and focus realistic? - Do the authors report results that are relevant to practitioners? Do they provide useful quantified estimates (costs, benefits, etc.) enabling practical impact quantification and prioritization? - Do they communicate (at least in the abstract or introduction) in ways policymakers and decision-makers can understand, without misleading or oversimplifying?Return STRICT JSON matching the supplied schema.Fill every key in the object `metrics`:{', '.join(METRICS)}Definitions are percentile scores (0 – 100) versus serious work in the field from the last 3 years.For `overall`: • Default = arithmetic mean of the other six midpoints (rounded). • If, in your judgment, another value is better (e.g. one metric is far more decision-relevant), choose it **and explain why** in `overall.rationale`.Field meanings midpoint → best-guess percentile lower_bound → 5th-percentile plausible value upper_bound → 95th-percentile plausible value rationale → ≤100 words; terse but informative.Do **not** wrap the JSON in markdown fences or add extra text.""").strip()def evaluate_paper(pdf_path: str| pathlib.Path, model: str= model) ->dict: paper_text = pdf_to_string(pdf_path) chat = client.chat.completions.create( model=model,# temperature=temperature, response_format=response_format, messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": paper_text} ] ) raw_json = chat.choices[0].message.contentreturn json.loads(raw_json)```### Batch-evaluate all PDFsBelow we iterate over every file in **papers/**, call `evaluate_paper()`, sleep 1.1 s(rate-limit padding), and store the full JSON per paper in **results/**.```{python}#| label: eval-many-metrics#| echo: falseimport pathlib, json, time, pandas as pdfrom tqdm import tqdm # progress bar; pip install tqdmROOT = pathlib.Path("papers") # put your PDFs hereOUT = pathlib.Path("results")OUT.mkdir(exist_ok=True)pdfs =sorted(ROOT.glob("*.pdf"))records = []for pdf in tqdm(pdfs, desc="Metrics"):try: res = evaluate_paper(pdf) # <-- your helper defined earlier :contentReference[oaicite:0]{index=0} res["paper"] = pdf.stem records.append(res) time.sleep(1.1) # gentle 90 req/min pacingexceptExceptionas e:print(f"⚠️ {pdf.name}: {e}")```### Results* Flattens the nested JSON into long format: one row = (paper, metric, midpoint, lower, upper, rationale). * Saves **results/metrics_long.csv** for quick downstream use.* Registers an Altair + Plotly “Unjournal” theme so all charts use – brand green `#99bb66` and accent orange `#f19e4b`<!-- – Source Sans Pro font (loaded via _quarto.yml → googlefonts) -->* Ridgeline density plot: Kernel-density of mid-points for each metric. Height is normalised, x-axisis 0-100 %. Green fill matches the site palette.* Heat-map* Interactive Plotly widget```{python}#| label: theme-deftidy_rows = []for rec in records: paper_id = rec["paper"]for metric, vals in rec["metrics"].items(): tidy_rows.append({"paper": paper_id,"metric": metric,**vals # midpoint, lower_bound, upper_bound, rationale })tidy = pd.DataFrame(tidy_rows)tidy.to_csv(OUT /"metrics_long.csv", index=False)# tidy.head()UNJ_GREEN ="#99bb66"# brand greenUNJ_ORANGE ="#f19e4b"# accent# ── 1 ▸ Altair theme ────────────────────────────────────────────def unj_theme():return {"config": {"view": {"stroke": "transparent"},"background": "white","title": {"font": "Source Sans Pro", "fontSize": 16, "color": "#222"},"axis": {"labelFont": "Source Sans Pro", "titleFont": "Source Sans Pro","labelColor": "#222", "titleColor": "#222","gridOpacity": 0.15},"legend": {"labelFont": "Source Sans Pro", "titleFont": "Source Sans Pro"},# continuous colourbars (heatmaps etc.)"range": {"heatmap": ["#f6fbf3", "#e2f1d7", "#cfe7ba", "#badc9c","#a6d27f", "#92c861", "#7dbd43", "#69b325","#55a807", "#477b13"# darker end ],# default nominal palette starts with green→orange"category": [ UNJ_GREEN, UNJ_ORANGE,"#6bb0f3", "#d9534f", "#636363", "#ffb400","#53354a", "#2780e3", "#3fb618", "#8e6c8a" ] } } }alt.themes.register("unj", unj_theme)alt.themes.enable("unj") # ← every Altair chart now uses it# ── 2 ▸ Plotly template ─────────────────────────────────────────pio.templates["unj"] = go.layout.Template( layout =dict( font=dict(family="Source Sans Pro, Helvetica, Arial, sans-serif", size=14, color="#222"), colorway=[UNJ_GREEN, UNJ_ORANGE, "#6bb0f3", "#d9534f", "#636363"], paper_bgcolor="white", plot_bgcolor="white", xaxis=dict(gridcolor="rgba(0,0,0,0.15)"), yaxis=dict(gridcolor="rgba(0,0,0,0.15)") ))pio.templates.default ="unj"``````{python}#| label: fig-ridgeline#| echo: false#| fig-cap: "Mid-point score distributions for every metric. Each density is scaled to unit height; x-axis is the percentile scale (0 – 100 %)."ridgeline = ( alt.Chart(tidy)# compute KDE per metric .transform_density( density='midpoint', groupby=['metric'], as_=['midpoint', 'density'], extent=[0, 100], bandwidth=2# tweak smoothness ) .transform_joinaggregate( maxD='max(density)', groupby=['metric'] )# normalise so each ridge has equal height .transform_calculate( norm='datum.density / datum.maxD' ) .mark_area(opacity=.7) .encode( x=alt.X('midpoint:Q', title='Percentile score', axis=alt.Axis(tickMinStep=10)), y=alt.Y('norm:Q', stack=None, # don’t stack across metrics title=None), row=alt.Row('metric:N', sort=METRICS[::-1], header=alt.Header(labelAngle=0, labelAlign='left')), color=alt.value('#99bb66') # constant colour ) .properties(height=60, width=650))ridgeline``````{python}#| label: fig-heatmap#| fig-cap: "Per-paper mid-point scores across all metrics. Darker green → higher percentile. Columns ordered by each paper’s overall average."if"short"notin tidy.columns:def make_short(name: str) ->str:if re.fullmatch(r"w\d{5}", name, re.I):return name.upper() # e.g. W30539 parts = re.split(r"[_\-\s]+", name) year =next((p for p in parts if re.fullmatch(r"\d{4}", p)), "") auth =next((p for p in parts if p and p[0].isalpha()), name)[:12]returnf"{auth.title()} ({year})"if year else auth.title() tidy["short"] = tidy["paper"].apply(make_short)# ── 1 ▸ reshape & attach short labels ──────────────────────────────heat = (tidy .pivot(index='metric', columns='paper', values='midpoint') .loc[METRICS[::-1]]) # keep row orderheat = (heat .reset_index() .melt(id_vars='metric', var_name='paper', value_name='midpoint') .merge(tidy[['paper','short']].drop_duplicates(), on='paper'))# ── 2 ▸ order papers by overall mean score (best → worst) ─────────order = (tidy.groupby('paper')['midpoint'] .mean() .sort_values(ascending=False) .index.to_list())order_short = (tidy[['paper','short']] .drop_duplicates() .set_index('paper') .loc[order]['short'] .to_list())# ── 3 ▸ build chart ────────────────────────────────────────────────alt.Chart(heat).mark_rect().encode( y = alt.Y('metric:N', sort=None, title=None), x = alt.X('short:N', sort=order_short, title=None), color = alt.Color('midpoint:Q', scale = alt.Scale(domain=[0,100]), # uses theme’s green gradient legend = alt.Legend(title='Score')), tooltip = ['short','metric','midpoint']).properties( height =40*len(METRICS), width =14* heat['short'].nunique())``````{python}#| label: fig-paper-widget#| fig-cap: "Interactive inspector: select a single paper to show its mid-point (dot) and 90 % credible interval (whisker) for every metric."# 1 ─── data prep ----------------------------------------------------METRICS_REV = METRICS[::-1] # want first metric on toptidy_sorted = (tidy .assign(metric=pd.Categorical(tidy.metric, categories=METRICS_REV, ordered=True)) .sort_values("metric"))papers = ["All"] +sorted(tidy.paper.unique())# 2 ─── build one trace per paper -----------------------------------fig = go.Figure()for pap in tidy.paper.unique(): dfp = tidy_sorted[tidy_sorted.paper == pap] fig.add_trace(go.Scatter( x=dfp.midpoint, y=dfp.metric, mode="markers", marker=dict(size=10), error_x=dict(type="data", symmetric=False, array=dfp.upper_bound - dfp.midpoint, arrayminus=dfp.midpoint - dfp.lower_bound), name=pap, visible=False# show later via dropdown ))# 3 ─── dropdown logic ----------------------------------------------buttons = []# individual papersfor i, pap inenumerate(tidy.paper.unique()): vis = [False] *len(tidy.paper.unique()) vis[i] =True buttons.append(dict( label=pap, method="update", args=[{"visible": vis}, {"title": f"Ratings for paper: {pap}"}] ))# 'All' view (dots overlaid)buttons.insert(0, dict( label="All", method="update", args=[{"visible": [True]*len(tidy.paper.unique())}, {"title": "Ratings for all papers"}]))fig.update_layout( updatemenus=[dict( buttons=buttons, direction="down", showactive=True, x=0.01, y=1.15 )], yaxis=dict(categoryorder="array", categoryarray=METRICS_REV, title=None), xaxis=dict(range=[0, 100], title="Percentile score"), height=40*len(METRICS) +120, title="Ratings for all papers")fig```